Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures

نویسندگان

  • Mario Lucic
  • Olivier Bachem
  • Andreas Krause
چکیده

Coresets are e cient representations of data sets such that models trained on the coreset are provably competitive with models trained on the original data set. As such, they have been successfully used to scale up clustering models such as K-Means and Gaussian mixture models to massive data sets. However, until now, the algorithms and the corresponding theory were usually specific to each clustering problem. We propose a single, practical algorithm to construct strong coresets for a large class of hard and soft clustering problems based on Bregman divergences. This class includes hard clustering with popular distortion measures such as the Squared Euclidean distance, the Mahalanobis distance, KLdivergence and Itakura-Saito distance. The corresponding soft clustering problems are directly related to popular mixture models due to a dual relationship between Bregman divergences and Exponential family distributions. Our theoretical results further imply a randomized polynomial-time approximation scheme for hard clustering. We demonstrate the practicality of the proposed algorithm in an empirical evaluation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering with Bregman Divergences

A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clust...

متن کامل

Simplification and hierarchical representations of mixtures of exponential families

A mixture model in statistics is a powerful framework commonly used to estimate the probability measure function of a random variable. Most algorithms handling mixture models were originally specifically designed for processing mixtures of Gaussians. However, other distributions such as Poisson, multinomial, Gamma/Beta have gained interest in signal processing in the past decades. These common ...

متن کامل

Parameter Estimation in Finite Mixture Models by Regularized Optimal Transport: A Unified Framework for Hard and Soft Clustering

In this short paper, we formulate parameter estimation for finite mixture models in the context of discrete optimal transportation with convex regularization. The proposed framework unifies hard and soft clustering methods for general mixture models. It also generalizes the celebrated k-means and expectation-maximization algorithms in relation to associated Bregman divergences when applied to e...

متن کامل

Scalable and Distributed Clustering via Lightweight Coresets

Coresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of coresets called lightweight cor...

متن کامل

Algorithms for the Bregman k-Median problem

In this thesis, we study the k-median problem with respect to a dissimilarity measure Dφ from the family of Bregman divergences: Given a finite set P of size n from R, our goal is to find a set C of size k such that the sum of error cost(P,C) = ∑ p∈P minc∈C { Dφ(p, c) } is minimized. This problem plays an important role in applications from many different areas of computer science, such as info...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016